Dai et al. BMC Systems Biology 2012, 6(Suppl 1):S17 
http://www.biomedcentral.eom/1752-0509/6/S1/S17 

Systems Biology 



RESEARCH Open Access 



Integrating many co-splicing networks to 
reconstruct splicing regulatory modules 

Chao Dai 1,2+ , Wenyuan Li 2+ , Juan Liu 1 , Xianghong Jasmine Zhou 2 * 

From The 5th IEEE International Conference on Computational Systems Biology (ISB 2011) 
Zhuhai, China. 02-04 September 201 1 




Abstract 

Background: Alternative splicing is a ubiquitous gene regulatory mechanism that dramatically increases the 
complexity of the proteome. However, the mechanism for regulating alternative splicing is poorly understood, and 
study of coordinated splicing regulation has been limited to individual cases. To study genome-wide splicing 
regulation, we integrate many human RNA-seq datasets to identify splicing module, which we define as a set of 
cassette exons co-regulated by the same splicing factors. 

Results: We have designed a tensor-based approach to identify co-splicing clusters that appear frequently across 
multiple conditions, thus very likely to represent splicing modules - a unit in the splicing regulatory network. In 
particular, we model each RNA-seq dataset as a co-splicing network, where the nodes represent exons and the 
edges are weighted by the correlations between exon inclusion rate profiles. We apply our tensor-based method 
to the 38 co-splicing networks derived from human RNA-seq datasets and indentify an atlas of frequent co-splicing 
clusters. We demonstrate that these identified clusters represent potential splicing modules by validating against 
four biological knowledge databases. The likelihood that a frequent co-splicing cluster is biologically meaningful 
increases with its recurrence across multiple datasets, highlighting the importance of the integrative approach. 

Conclusions: Co-splicing clusters reveal novel functional groups which cannot be identified by co-expression 
clusters, particularly they can grant new insights into functions associated with post-transcriptional regulation, and 
the same exons can dynamically participate in different pathways depending on different conditions and different 
other exons that are co-spliced. We propose that by identifying splicing module, a unit in the splicing regulatory 
network can serve as an important step to decipher the splicing code. 



Background 

Alternative splicing provides an important means for 
generating proteomic diversity. Recent estimates indicate 
that nearly 95% of human multi-exon genes are alterna- 
tively spliced [1]. The mechanism for regulating alterna- 
tive splicing is still poorly understood, and its 
complexity attributes to the combinatorial regulation of 
many factors, e.g. splicing factors, cis-regulatory ele- 
ments, and RNA secondary structure [2,3]. A fundamen- 
tal task of alternative splicing research is to decipher 
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splicing code and understand the mechanism of how an 
exon is alternatively spliced in tissue-specific manner. 

A central concept in transcription regulation is the 
transcription module, defined as a set of genes that are 
co-regulated by the same transcription factor(s). Analo- 
gously, such coordinated regulation also occurs at the 
splicing level [4-6]. For example, the splicing factor 
Nova regulates exon splicing of a set of genes that shape 
the synapse [6]. However, the study of such coordinated 
splicing regulation has thus far been limited to indivi- 
dual cases [5-9]. In this paper, we define a splicing mod- 
ule as a set of exons that are regulated by the same 
splicing factors. The exons in a splicing module can 
belong to different genes, but they exhibit correlated 
splicing patterns (in terms of being included or excluded 
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in their respective transcripts) across different condi- 
tions, thus form an exon co-splicing cluster. 

The recent development of RNA-seq technology pro- 
vides a revolutionary tool to study alternative splicing. 
From each RNA-seq dataset, we can derive not only the 
expression levels of genes, but also those of exons and 
transcripts (i.e., splicing isoforms). Given an RNA-seq 
dataset containing a set of samples, we can calculate the 
inclusion rate of each exon (In this study we only con- 
sider cassette exons, which are common in alternative 
splicing events. Henceforth, the term "exon" always 
means "cassette exons".) In every sample, as the ratio 
between its expression level and that of the host gene. A 
recent study provided a nice example of studying spli- 
cing regulatory relationships using a network of exon- 
exon, exon-gene, and gene-gene links [10]. Here, we 
construct from each RNA-seq dataset a weighted co-spli- 
cing network where the nodes represent exons and the 
edge weights are correlations between the inclusion 
rates of two exons across all samples in the dataset. 
While directly comparing the inclusion rates for the 
same exon in different datasets could be biased by plat- 
forms and protocols, the correlations between inclusion 
rates for a given exon pair are comparable across data- 
sets. From a series of RNA-seq datasets, we can there- 
fore derive a series of co-splicing networks, which can 
be subjected to comparative network analysis and pro- 
vide an effective way to integrate a large number of 
RNA-seq experiments conducted in different labora- 
tories and using different technology platforms. 

A heavy subgraph in a weighted co-splicing network 
represents a set of exons that are highly correlated in 
their inclusion rate profiles; i.e., they are co-spliced. A 
set of exons which frequently form a heavy subgraph in 
multiple datasets are likely to be regulated by the same 
splicing factors, and thus form a splicing module. We 
call such patterns frequent co-splicing clusters. Due to 
the enhanced signal to noise separation, frequent clus- 
ters are more robust and are more likely to be regulated 
by the same splicing factors (thus more likely to repre- 
sent splicing modules) than those heavy subgraphs 
derived from a single dataset. In our previous research 
[11], we showed that the likelihood for a gene co- 
expression cluster to be a transcription module increases 
significantly with the recurrence of clusters in multiple 
datasets. A similar principle applies to splicing modules. 

In this paper, we adopt our recently developed tensor- 
based approach to find the heavy subgraph that fre- 
quently occur in multiple weighted networks [12]. Our 
goal here is to identify co-spliced exon clusters that fre- 
quently occur across multiple weighted co-splicing net- 
works. A co-splicing network of n nodes (exons) can be 
represented as an n x n adjacency matrix A, where ele- 
ment dij is the weight of the edge between nodes i and 



/. This weight represents the correlation between the 
two exons' inclusion rate profiles. Given m co-splicing 
networks with the same n nodes but different edge 
weights, we can represent the whole system as a 3 rd - 
order tensor (or 3-dimensional array) of size n x n x m. 
An element a^ of the tensor is the weight of the edge 
between nodes i and / in the k th network (Figure 1). A 
co-splicing cluster appears as a heavy subgraph in the 
co-splicing network, which in turn corresponds to a 
heavy region in the adjacency matrix. A frequent co-spli- 
cing cluster is one that appears in multiple datasets, and 
appears as a heavy region of the tensor (Figure 1). Thus, 
the problem of identifying frequent co-splicing clusters 
can intuitively be formulated as the problem of identify- 
ing heavy subtensors in a tensor. By representing net- 
works and formulating the problem in this tensor form, 
we gain access to a wealth of established optimization 
methods for multidimensional arrays. Reformulating a 
discrete graph discovery problem as a continuous opti- 
mization problem is a longstanding tradition in graph 
theory. There are many successful examples, such as 
using a Hopfield neural network to solve the traveling 
salesman problem [13] and applying the Motzkin-Straus 
theorem to the clique-finding problem [14]. Moreover, 
when a graph-based pattern mining problem is trans- 
formed into a continuous optimization problem, it 
becomes easy to incorporate constraints representing 
prior knowledge. Finally, advanced continuous optimiza- 
tion techniques require very few ad hoc parameters, in 
contrast with most heuristic graph combinatorial 
algorithms. 

We applied our tensor algorithm to 38 weighted 
exon co-splicing networks derived from human RNA- 
seq datasets. We identified an atlas of frequent co- 
splicing clusters and validated them against four biolo- 
gical knowledge bases: Gene Ontology annotations, 
RNA-binding motif database, 191 ENCODE genome- 
wide ChlP-seq profiles, and protein complex database. 
We demonstrate that the likelihood for an exon clus- 
ter to be biologically meaningful increases with its 
recurrence across multiple datasets, highlighting the 
benefit of the integrative approach. Moreover, we 
show that co-splicing clusters can reveal novel func- 
tional groups that cannot be identified by co-expres- 
sion clusters. Finally, we show that the same exons 
can dynamically participate in different pathways, 
depending on different conditions and different other 
exons that are co-spliced. 

Results 

We identified 38 human RNA-seq datasets from the 
NCBI Sequence Read Archive (http://www.ncbi.nlm. 
nih.gov/sra) (see additional file 1: Section S6 to get 
details of these datasets), each with at least 6 samples 
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Figure 1 Illustration of the 3 rd -order tensor representation of a collection of networks A collection of co-splicing networks can be 
"stacked" into a third-order tensor such that each slice represents the adjacency matrix of one network. The weights of edges in the co-splicing 
networks and their corresponding entries in the tensor are color-coded according to the scale to the right of the figure. After reordering the 
tensor by the exon and network membership vectors, a frequent co-splicing cluster (colored in red) emerges in the top-left corner. It is 
composed of exons A, 8, C, D which are heavily interconnected in networks 1, 2, 3. 



providing transcriptome profiling under multiple 
experimental conditions, such as diverse tissues or var- 
ious diseases. For each dataset, we used Tophat [15] 
tool to map short reads to the hgl9 reference genome 
and applied the transcript assembly tool Cufflinks [16] 
to estimate expressions for all transcripts with known 
UCSC transcript annotations [17]. We calculated the 
inclusion rate of each exon, as the ratio between its 
expression (the sum of FPKM (FPKM stands for "Frag- 
ments Per Kilobase of exon per Million fragments 
mapped", as defined in [16].) over all transcripts that 
cover the exon) and the host gene's expression (the 
sum of FPKM over all transcripts of the gene). It is 
worth noting that in RNA-seq experiments, a gene 
expression with low FPKM is usually not precisely esti- 
mated because the number of reads mapped to the 
gene is quite small. In order to work with reasonably 
accurate estimates of exon inclusion rates, as pointed 
out by [18], we calculated inclusion rates only for 
those genes whose expressions are above 80 th percen- 
tile across at least 6 samples. This criterion resulted in 
inclusion rate profiles for 16,024 exons covering 9,532 
genes. Based on these profiles, we constructed an exon 
co-splicing network from each RNA-seq dataset by 
using Pearson's correlation between exons' inclusion 
rate profiles. Details of data processing refer to addi- 
tional file 1 (Section S5). 

We applied our method to 38 RNA-seq datasets gen- 
erated under various experimental conditions. Adopting 
the empirical criteria of "heaviness" > 0.4 and cluster 
size >5 exons, we identified 7,194/3,104/1,422/594 co- 
splicing clusters with recurrences >3/4/5/6. 



Frequent co-splicing clusters are likely to represent 
functional modules, splicing modules, transcriptional 
modules, and protein complexes 

To assess the biological significance of the identified 
patterns, we evaluate the extent to which these exon 
clusters represent functional modules, splicing modules, 
transcriptional regulatory modules, and protein com- 
plexes. Due to the difference of background "gene" 
numbers, we set different p-value thresholds for signifi- 
cance test. 
Functional analysis 

We evaluated the functional homogeneity of the host 
genes in an exon cluster using Gene Ontology (GO) 
annotations. To ensure the specificity of GO terms, we 
filtered out general GO terms associated with >300 
genes. If the host genes of exons in a cluster are statisti- 
cally enriched in a GO term with p-vahie < 1E-4 (based 
on the hypergeometric test), we declare the exon cluster 
to be functionally homogeneous. We found that 23.3% 
of clusters appearing in >3 datasets are functionally 
homogenous, compared to only 6.0% of randomly gener- 
ated clusters with the same sizes. This enrichment fold 
ratio of 3.9 between real and random patterns demon- 
strates the strong biological relevance of the identified 
patterns. Furthermore, the enrichment fold ratio 
increases with its recurrence of FSCs (shown in Figure 
2A). For example, when the FSCs are required to recur 
in at least 5 datasets, their enrichment fold ratio com- 
pared to random patterns increases to 4.4, confirming 
the benefits of the integrative analysis of multiple RNA- 
seq datasets in improving the quality of detected pat- 
terns. Functionally homogenous clusters cover a wide 
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range of post-transcriptional associated GO terms, such 
as "RNA splicing", "ribonucleoprotein binding", "hetero- 
geneous nuclear ribonucleoprotein complex", "negative 
regulation of transcription from RNA polymerase II pro- 
moter", and "cellular protein localization". 
Splicing regulatory analysis 

By construction, the exons in our identified co-splicing 
clusters have highly correlated inclusion rate profiles 
across different experimental conditions. Clusters meet- 
ing this criterion are likely to consist of exons co-regu- 
lated by the same splicing factors. It has been shown 
that splicing factors can affect alternative splicing by 
interacting with cis-regulatory elements in a position- 
dependent manner [19]. We collected the experimental 
RNA target motifs (2220 RNA binding sites) of 62 spli- 
cing factors from the SpliceAid2 database [20]. To iden- 
tify possible splicing factors associated with a co-splicing 
cluster, for each exon of a co-splicing cluster, we 
retrieved the internal exon region and its 50bp flanking 
intron region which are enriched in the motifs of those 
62 splicing factors by performing BLAST search (E- 
score < 0.001). If the exons of a cluster are highly 
enriched in the targets of a splicing factor, we consider 
the cluster to be "splicing homogeneous". Although the 
collection of known splicing motifs is very limited, at 
the p-val\ie < 0.05 level (based on hypergeometric test), 
we still observed that 4.9% clusters with >5 exons and 
>6 recurrences are splicing homogenous, compared to 
1.6% of randomly generated patterns with the same size 
distribution. Its enrichment fold ratio is 3.0. Performing 
the same analysis for less frequent clusters, we found 
that as the recurrence increases, so does the enrichment 
fold ratio (Figure 2B). The five most frequently enriched 
splicing factors are hnRNP E2, 9G8, hnRNP U, SRp7S 
and SRp30c. hnRNP E2 and hnRNP U both belong to 
heterogeneous nuclear ribonu-cleoprotein family, which 
generally suppress splicing through binding to exonic 
splicing silencer [2]. Studies show that hnRNP E2 can 
repress exon usage when present at high levels in vitro 
[21], and hnRNP U bind pre-mRNA as well as nuclear 
mRNA and play an important role in processing and 



transport of mRNA [22]. 9G8, SRp75 and SRp30c all 
belong to the SR family of splicing regulators. 9G8 pro- 
tein excluding other SR factors can rescue the splicing 
activity of a 9G8-depleted nuclear extract, indicating 
9G8 plays a crucial role in splicing [23]. SRp75 is pre- 
sent in messenger ribonucleoproteins in both cycling 
and differentiated cells, and shuttles between nucleus 
and cytoplasm, implicating its widespread roles in spli- 
cing regulation [24]. SRp30c can function as a repressor 
of 3' splice site utilization and SRp30c-CE9 interaction 
may contribute to the control of hnRNP Al alternative 
splicing [25]. 

We found that some splicing factors tend to co-bind 
to the cis-regulatory regions of exons in a co-splicing 
cluster, suggesting the combinatorial regulation of those 
splicing factors. Trans-acting SR proteins Tra2cc and 
SRp30c are simultaneously enriched in 18 clusters (with 
recurrence >3), whose major functions (by GO term 
enrichment) are related to post-transcriptional regula- 
tion, such as "ribonucleoprotein binding" (p-value = 
2.11E-5), "nuclear mRNA splicing, via spliceosome" {p- 
value = 7.66E-5), "RNA export from nucleus" (p-vahie = 
4.81E-5), and "translational initiation" (p-value = 2.48E- 
5). Current study indeed shows that there is a coopera- 
tive interaction between Tra2cc and SRp30c in exonic 
splicing enhancer dependent GnRG pre-mRNA splicing 
[26]. Splicing regulators SRp20, SRp30c and SRp7S are 
simultaneously enriched in 2 clusters (with recurrence 
>3), which are also associated with post-transcriptional 
regulation. For example, "RNA splicing" (p-value = 
3.25E-6), "translation initiation factor activity" (p-value = 
7.42E-5), and "eukaryotic translation initiation factor 3 
complex" (p-value = 2.17E-4). Our results suggest that 
combinatorial splicing regulation can occur in post-tran- 
scriptional processes. 
Transcriptional and epigenomic analysis 
To evaluate how co-splicing is affected by transcrip- 
tional regulation, we used 191 ChlP-seq profiles gener- 
ated by the Encyclopedia of DNA Elements (ENCODE) 
consortium [27]. This dataset includes the genome-wide 
bindings of 40 transcription factors (TF), 9 histone 
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Figure 2 Evaluation of the functional, splicing, transcriptional, and protein complex homogeneity of co-splicing clusters with different 
recurrences. Four types of databases are used: (A) Gene Ontology for functional enrichment, (B) SpliceAid2 database for splicing enrichment, 
(C) ENCODE database for transcriptional and epigenetic enrichment, and (D) CORUM database for protein complex enrichment. The x-axis is 
network recurrence and y-axis is enrichment fold ratio, calculated by dividing the percentage of enriched clusters by the percentage of enriched 
random clusters. 
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modification marks, and 3 other markers (DNase, 
FAIRE, and DNA methylation) on 25 different cell lines. 
For a detailed description of the signal extraction proce- 
dure, see the additional file 1 (Section S7). If the host 
genes of an exon cluster are highly enriched in the tar- 
gets of any regulatory factor, we consider the cluster to 
be "transcription homogenous". At the significance level 
/7-value <0.01, 74.9% clusters with recurrences >3 are 
transcription homogenous, compared to only 21.2% of 
randomly generated clusters with the same sizes. As 
expected, the enrichment fold ratio increases with recur- 
rence (Figure 2C). This result suggests a strong associa- 
tion between transcription and splicing. The four most 
frequently enriched regulatory factors are TAF8, GABP, 
FOS and NFYB. TAF8 is a subunit of transcription 
initiation factor TFIID, which is required for accurate 
and regulated initiation by RNA polymerase II [28]. As 
an ETS transcription factor, GABP plays a key role in 
regulating genes which are intimately involved in cell 
cycle control, protein synthesis and cellular metabolism 
[29]. FOS can dimerise with c-Jun to form AP-1 tran- 
scription factor, which upregulates transcription of a 
wide range of genes involved in proliferation and differ- 
entiation to defense against invasion and cell damage 
[30]. NFYB is a subunit of an ubiquitous heteromeric 
transcription factor NF-Y, which regulates 30% of mam- 
malian promoters [31]. 
Protein complex analysis 

We evaluate the extent to which host genes of our iden- 
tified exon clusters are protein complexes by using the 
Comprehensive Resource of Mammalian protein com- 
plexes database (CORUM, September 2009 version) 
[32]. At the significance level p-value <0.05, 18.1% of 
co-splicing clusters with recurrences >3 are enriched in 
genes belonging to a protein complex, versus only 0.7% 
of randomly generated clusters with the same sizes. The 
enrichment fold ratio for protein complexes increases 
with the cluster recurrence (Figure 2D). The five most 
frequently enriched protein complexes are "Spliceo- 
some", "CCT micro-complex", "large Drosha complex", 
'7Vbj?56p-associated pre-rRNA complex", and "C com- 
plex spliceosome". At least 1/3 of subunits in the 
enriched complex "large Drosha complex" contain pro- 
teins associated with splicing function, especially hetero- 
geneous nuclear ribonucleoproteins such as HNRNPH1, 
HNRNPM, HNRNPU, HNRNPUL1 and HNRNPDL [32], 

Co-splicing clusters reveal novel functions that are not 
identified by co-expression clusters 

Studies have shown that genes that are co-regulated 
transcriptionally do not necessarily overlap with those 
that are co-spliced [33]. Therefore, the identification of 
co-splicing clusters can reveal functionally related genes 
that could not be discovered from transcription analysis. 



In order to identify novel functions associated with co- 
splicing but not co-expression, we complement the 
above analysis by constructing a gene co-expression net- 
work from each RNA-seq dataset. The nodes of these 
networks represent genes, and the edges are weighted 
by Pearson's correlation between two gene expression 
profiles. We then apply our tensor-based pattern mining 
algorithm to identify frequent co-expression clusters in 
the 38 co-expression networks with the same criteria as 
those of identifying co-splicing clusters. The same func- 
tional enrichment analysis described above for co-spli- 
cing clusters was performed on the resulting co- 
expression clusters. We found that 98.8% of co-splicing 
clusters with recurrences > 3 have low expression corre- 
lations (average correlations < 0.2). Therefore, many of 
the functions associated with post-transcriptional regula- 
tion are enriched in co-splicing clusters but not in co- 
expression clusters. These functions include "mainte- 
nance of protein location", "regulation of protein cata- 
bolic process", "cytoplasmic sequestering of protein", 
"regulation of intracellular protein transport", "regula- 
tion of ubiquitin-protein ligase activity", "ribonucleopro- 
tein complex assembly", "RNA splicing, via 
transesterification reactions", and "RNA export from 
nucleus". 

For example, one co-splicing cluster has seven host 
genes: HNRNPUL1, HNRNPC, DHX9, BAT1, PSMA5, 
RAD23 and RPS9. This cluster cannot be found from 
co-expression data, for the expression profiles of the 
host genes have low correlations. However, this set of 
host genes is enriched with several splicing associated 
functions, including "RNA splicing" (p-value = 1.89E-6) 
and "RNA helicase activity" (p-value = 4.68E-5). Out of 
seven host genes, HNRNPUL1 and HNRNPC belong to 
heterogeneous nuclear ribonucleoprotein family, which 
generally suppress splicing through binding to exonic 
splicing silencer [2]. DHX9, known as RNA helicase A, is 
a highly conserved DEAD-box protein that activates 
transcription, modulates RNA splicing, binds the nuclear 
pore complex and involves in spliceosome assembly 
[34,35]. Previous research illustrated that DHX9 med- 
iates association of CBP and RNA polymerase II [36], 
and current study further shows that DHX9 interacts 
with post-transcriptional control element RNA in the 
nucleus and cytoplasm to facilitate efficient translation 
[34]. Interestingly, HNRNPC and DHX9 are indeed 
tightly functionally associated: silencing of DHX9 ser- 
iously disturbed the nuclear distribution of the hnRNP 
C protein [37]. As an essential splicing factor, BAT1 
also belongs to DEAD-box protein family, and plays an 
important role in mRNA export from the nucleus to the 
cytoplasm, supported by recent experimental evidence 
that knocked down BAT1 induces spliced mRNA, as 
well as total polyA RNA accumulating in nuclear 
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speckle domains, not exporting to the cytoplasm [38]. 
Clearly, co-splicing clusters can provide complementary 
information on functionally related gene groups in addi- 
tion to co-trancriptional clusters. In particular, co-spli- 
cing clusters can grant new insights into functions 
associated with post-transcriptional regulation. 

Exons can dynamically participate in different pathways 
upon different co-splicing mechanisms 

Alternatively skipping or including a cassette exon can 
change the functions of a protein by deleting or insert- 
ing a protein domain. In other words, protein isoforms 
alternatively spliced from the same gene may participate 
in different pathways. In our results, we observed that 
70.3%/52.3%/38.3%/27.1% of exons are members of at 
least two clusters (recurrence>3/4/5/6) with different 
functions. For example, exon8 of the gene Rela appears 
in four co-splicing clusters with recurrences >3, which 
are enriched with the following distinct functions 
respectively: "ER-associated protein catabolic process" 
(/3-value = 2.20E-5), "response to extracellular stimulus" 
Qc-value = 3.80E-5), "regulation of gene-specific tran- 
scription" (^-value = 8.89E-5), and "positive regulation 
of intracellular protein kinase cascade" {p-vahie = 2.49E- 
5). Rela encodes the transcription factor p65, which is 
an important subunit of the NF-kB complex that affects 
several hundred genes by NF-kB signaling. Recent 
research has identified several alternative splice variants 
of Rela, e.g. p65A, p65A2 and p65A3. In fact, p65A 
arises by the use of an alternative splice site located 30 
nucleotides into exon8, and p6SA3 was identified as a 
splice variant lacking exon7 and exon8 [39]. These facts 
are consistent with our finding that exon8 is dynami- 
cally included in multiple co-splicing clusters. As 
another example, exon2 of the gene EIF5 appears in 
three co-splicing clusters with recurrences >3, which are 
enriched with following distinct functions respectively: 
"RNA splicing" (p-value = 6.27E-5), "mRNA polyadeny- 
lation" (p-value = 1.57E-5), and "regulation of transla- 
tional initiation" {p-value = 8.18E-5). As a translation 
initiation factor, EIF5 plays critical roles for the accurate 
recognition of correct start codon during translation 
initiation [40]. Our result suggests that except for trans- 
lation initiation regulation, EIF5 may also involve in 
post-transcriptional regulation, such as RNA splicing 
and mRNA polyadenylation by dynamically including 
exon2 in multiple co-splicing clusters. These examples 
demonstrate that exons can contribute to different func- 
tionalities of proteins depending on different splicing 
regulatory mechanisms. 

Conclusions 

Splicing code is determined by a combination of many 
factors, such as cis-regulatory elements and transacting 



factors. If some exons share the same splicing code, they 
may form a splicing module: a unit in the splicing regu- 
latory network. Therefore, identifying co-splicing clus- 
ters first and then investigating their cis-regulatory 
elements and associated trans-acting factors can serve as 
an important step to decipher the splicing code. Our 
tensor-based approach can identify co-spliced exon clus- 
ters that frequently appear in multiple RNA-seq data- 
sets. The exons in a frequent co-splicing cluster can 
belong to different genes, but are very likely to be co- 
regulated by the same splicing factors, thus forming a 
splicing module. We demonstrated that the identified 
clusters represent meaningful biological modules, i.e. 
functional modules, splicing modules, transcriptional 
modules, and protein complexes, by validating against 
four biological knowledge databases. In all four types of 
enrichment results, the likelihood that a co-splicing 
cluster is biologically meaningful increases with its 
recurrence. This consistent behavior highlights the 
importance of the integrative approach. We also showed 
that the co-splicing clusters can reveal novel functional 
related genes that cannot be identified by co-expression 
clusters, and that the same exons can dynamically parti- 
cipate in different pathways depending on different con- 
ditions and different other exons that are co-spliced. 
The NCBI Sequence Read Archive database currently 
stores 8099 RNA-seq profiles, and this number is 
expected to dramatically increase in the near future. We 
expect to apply our approach to the rapidly accumulat- 
ing RNA-seq data of multiple organisms, and to identify 
a large number of splicing modules and their associated 
phenotype conditions. This analysis can serve as a first 
step towards the reconstruction of tissue- and disease- 
specific splicing regulatory networks. 

Methods 

Given an RNA-seq dataset, we construct a co-splicing 
network where nodes represent exons and edges are 
weighted by the correlation between two exon inclusion 
rate profiles. Given m co-splicing networks with the 
same n nodes but different edge weights, we can repre- 
sent the whole system as a 3 rd -order tensor 
A = {aijk)nxnxm- A frequent co-splicing cluster (FSC) in 
the tensor A can be defined by two membership vec- 
tors: (i) the exon membership vector x = (x\, x„) T , 
where x t = 1 if exon i belongs to the cluster and x t = 0 
otherwise; and (ii) the network membership vector y = 
(jii ••• > y m ) T i where y> = 1 if the exons of the cluster are 
heavily interconnected in network / and yj = 0 other- 
wise. The summed weight of all edges in the FSC is 

^ n n m 

h a(x, y) = - a vkXiXjyk- (l) 

,=i j=i k=i 
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Note that only the weights of edges with x t = x> = 
jk = 1 are counted in H4 . Thus, H^(x,y) measures 
the "heaviness" of the FSC defined by x and y. The pro- 
blem of discovering a frequent co-splicing cluster can be 
formulated as a discrete combinatorial optimization pro- 
blem: among all patterns of fixed size (K± member exons 
and K 2 member networks), we look for the heaviest. This 
is also an integer programming problem: find the binary 
membership vectors x and y that jointly maximize H4 
under the constraints Ya=\ x < = ^1 anc ' Ylf=i Yj = Ki . 
However, there are several major drawbacks to this dis- 
crete formulation. 

The first is parameter dependence, meaning that the 
size parameters Ki and K 2 are hard for users to provide 
and control. The second is high computational complex- 
ity; the task is proved to be NP-hard (see additional file 
1: Section SI) and therefore not solvable in a reasonable 
time even for small datasets. Therefore, the discrete 
optimization problem is infeasible for an integrative ana- 
lysis of many massive networks. Instead, we solve a con- 
tinuous optimization problem with the same objective 
by relaxing integer constraints to continuous con- 
straints. That is, we look for non-negative real vectors x 
and y that jointly maximize H4 . This optimization pro- 
blem is formally expressed as follows: 

maXxei^ygRm H A (x,y) 
subject to/(x) = 1 and g(y) = 1 

where R + is a non-negative real space, and fx) and g 
(y) are vector norms. After solving Eq. (2), users can 
easily identify the top-ranking networks (after sorting 
the tensor by y) and top-ranking exons (after sorting 
each network by x) contributing to the objective func- 
tion. After rearranging the networks in this manner, the 
FSC with the largest heaviness occupies a corner of the 
3D tensor. We can then mask all edges in the heaviest 
FSC with zeros, and optimize Eq. (2) again to search for 
the next FSC. 

The choice of vector norms in Eq. (2) has a significant 
impact on the outcome of the optimization. A vector 
norm defined as ||x|| p = (XX 1 |x;| p ) 1 / p , where p > 0, is 
also called an "Z^-vector norm". In general, the closer p 
is to zero, the sparser the solution favored by the L p - 
norm; that is, fewer components of the optimized vec- 
tors are significantly different from zero [41]. In con- 
trast, as p increases, the solution favored by the L p - 
norm grows smoother; in the extreme case p — > °°, the 
elements of the optimized vector are approximately 
equal to each other. For more details on these vector 
norms, refer to the additional file 1: Section S2. Our 
ideal membership vector selects a small number of 
exons ("sparse") whose values are close to each other in 
magnitude ("smooth"), while the rest of exons are close to 



zero. Our past research [12] has shown that this goal 
can be achieved using the mixed norm L 0oa (x) = a||x|| 0 
+ (1 - a) Hxl^ (0 <a < 1) for fix). The norm L 0 favors 
sparsity while the norm encourages smoothness in 
the non-zero components of x. In practice, we approxi- 
mate Z, 0 ,oo(x) with another mixed norm: L p2 {x) = <x\x\ p 
+ (1 - a)||x|| 2 , where p < 1. Our criteria for the network 
membership vector are similar. We want the exon clus- 
ter to appear in as many networks as possible, so the 
network membership values should be non-zero and 
close to each other. This is the typical outcome of opti- 
mization using the L x norm. In practice, we approxi- 
mate I„ with L q {y), where q > 1 for g(y). Therefore, the 
vector norms fix) and g(y) are fully specified as follows, 

/(x) = orllxllp + (1 - a)||x|| 2 and g(y) = fyf^ (3) 

We performed simulations to determine suitable 
values for the parameters p, a, and q, by applying our 
tensor method to collections of random weighted net- 
works. We randomly placed FSCs of varying size, recur- 
rence, and heaviness in a subset of the random 
networks. "We then tried different combinations of p, a, 
and q, and adopted the combination (p = 0.8, a = 0.2, 
and q = 10) that led to the discovery of the most FSCs. 
More details on these simulations are provided in the 
additional file 1 (Section S4). 

Since the vector norm fix) is non-convex, our tensor 
method requires an optimization protocol that can 
deal with non-convex constraints. The quality of the 
optimum discovered for a non-convex problem 
depends heavily on the numerical procedure. Standard 
numerical techniques such as gradient descent con- 
verge to a local minimum of the solution space, and 
different procedures often find different local minima. 
Thus, it is important to find a theoretically justified 
numerical procedure. We use an advanced framework 
known as multi-stage convex relaxation, which has 
good numerical properties for non-convex optimiza- 
tion problems [41]. In this framework, concave duality 
is used to construct a sequence of convex relaxations 
that give increasingly accurate approximations to the 
original non-convex problem. We approximate the 
sparse constraint function fix) by the convex function 
/ v (x) = v T h(x) —fh(v), where h{x) is a specific convex 
function h(x) = x 2 and (v) is the concave dual of 
the function fh(v) (defined as f[v) =f h (h(v)))- The 
vector v contains coefficients that will be automatically 
generated during the optimization process. After each 
optimization, the new coefficient vector v yields a con- 
vex function / v (x) that more closely approximates the 
original non-convex function fix). Details of our ten- 
sor-based optimization method can be found in the 
additional file 1 (Section S3). 
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Once the membership vectors (i.e., the solution of Eq. 
(2)) have been found by optimization, the frequent co-spli- 
cing clusters can be intuitively obtained by including those 
exons and networks with large membership values. How- 
ever, any given solution can result in multiple overlapping 
patterns whose "heaviness" is greater than a specified 
threshold. Here, heaviness is defined as the average weight 
of all edges in the pattern. To identify the most represen- 
tative pattern, we first rank exons and networks in 
decreasing order of their membership values in x and y. 
Then we extract two representative patterns that satisfy 
the heaviness threshold: the pattern that occurs in the 
most networks while having at least the minimum number 
of top-ranking exons (e.g., 5), and the pattern with the lar- 
gest number of top-ranking exons while appearing in at 
least the minimum number of top-ranking networks (e.g., 
3). Both patterns are included as co-splicing clusters in 
our results. After discovering a pattern, we can mask its 
edges in those networks where they occur (replacing those 
elements of the tensor with zeroes) and optimize Eq. (2) 
again to search for the next frequent co-splicing cluster. 

Additional material 



Additional file 1: Supplementary material. Additional file provides 
supplementary material which gives details of data processing and 
methods. 
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